Automated data ingestion using an autoencoder

ABSTRACT

Systems, methods, apparatuses, and computer program products for processing data using an autoencoder. In one example, the autoencoder may receive streaming data comprising numeric values during a first time interval. The autoencoder may determine, during the first time interval, a maximum value and a minimum value of a first subset of the numeric values. The autoencoder may then process, during the first time interval, a second subset of the numeric values based on the determined maximum and minimum values.

RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/549,465, titled “AUTOMATED DATA INGESTION USING AN AUTOENCODER” filedon Aug. 23, 2019. The contents of the aforementioned application areincorporated herein by reference in their entirety.

TECHNICAL FIELD

Embodiments disclosed herein generally relate to deep learning, and morespecifically, to training an autoencoder to perform automated dataingestion.

BACKGROUND

Input data is often received in different formats. Data engineeringinvolves converting the format of input data to a desired format.However, data engineering is conventionally a manual process whichrequires significant time and resources. Furthermore, data engineeringsolutions are not portable, such that a new solution needs to bemanually designed for different types of input data and/or desiredoutput formats.

SUMMARY

Embodiments disclosed herein provide systems, methods, articles ofmanufacture, and computer-readable media for training an autoencoder toperform automated data ingestion. In one example, the autoencoder mayreceive streaming data comprising numeric values during a first timeinterval. The autoencoder may determine, during the first time interval,a maximum value and a minimum value of a first subset of the numericvalues. The autoencoder may then process, during the first timeinterval, a second subset of the numeric values based on the determinedmaximum and minimum values.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an embodiment of a system that uses an autoencoder toperform automated data ingestion.

FIG. 2 illustrates an embodiment of training an autoencoder to performautomated data ingestion.

FIG. 3 illustrates an embodiment of a processing pipeline.

FIG. 4 illustrates an embodiment of a first logic flow.

FIG. 5 illustrates an embodiment of a second logic flow.

FIG. 6 illustrates an embodiment of a computing architecture.

DETAILED DESCRIPTION

Embodiments disclosed herein provide techniques to use an autoencoder toautomatically format input data according to a desired output format.Generally, embodiments disclosed herein may sample a dataset. Astatistical model (or other machine learning (ML) model) may format thedata sampled from the dataset, thereby generating a formatted outputdataset. A training dataset may then be used to train the autoencoder toformat data. The training dataset may include the data sampled from thedataset as an input dataset and the formatted output dataset generatedby the statistical model as an output dataset. The training dataset mayinclude overlapping “chunks” such that the same data may appear in twoor more chunks. Generally, during training, the autoencoder attempts toformat the input dataset, thereby generating an output. The statisticalmodel (or other ML model) may analyze the output of the autoencoder todetermine an accuracy of the autoencoder. The determined accuracy of theautoencoder may then be used to train the values of a latent vector ofthe autoencoder. The training of the autoencoder may be repeated untilthe accuracy of the autoencoder exceeds a threshold. The trainedautoencoder may then be used for data ingestion, e.g., by attaching thetrained autoencoder to all new models and/or datasets.

Advantageously, embodiments disclosed herein provide techniques toautomatically format data using an autoencoder. Advantageously, theautoencoder may be trained to appropriately format all data, even if thedata has not been previously analyzed. Furthermore, embodimentsdisclosed herein provide scalable solutions that can be ported to anytype of data processing pipeline, regardless of any particular inputand/or output data formats. Further still, embodiments disclosed hereinmay train the autoencoder using only the training dataset and/or aportion thereof.

With general reference to notations and nomenclature used herein, one ormore portions of the detailed description which follows may be presentedin terms of program procedures executed on a computer or network ofcomputers. These procedural descriptions and representations are used bythose skilled in the art to most effectively convey the substances oftheir work to others skilled in the art. A procedure is here, andgenerally, conceived to be a self-consistent sequence of operationsleading to a desired result. These operations are those requiringphysical manipulations of physical quantities. Usually, though notnecessarily, these quantities take the form of electrical, magnetic, oroptical signals capable of being stored, transferred, combined,compared, and otherwise manipulated. It proves convenient at times,principally for reasons of common usage, to refer to these signals asbits, values, elements, symbols, characters, terms, numbers, or thelike. It should be noted, however, that all of these and similar termsare to be associated with the appropriate physical quantities and aremerely convenient labels applied to those quantities.

Further, these manipulations are often referred to in terms, such asadding or comparing, which are commonly associated with mentaloperations performed by a human operator. However, no such capability ofa human operator is necessary, or desirable in most cases, in any of theoperations described herein that form part of one or more embodiments.Rather, these operations are machine operations. Useful machines forperforming operations of various embodiments include digital computersas selectively activated or configured by a computer program storedwithin that is written in accordance with the teachings herein, and/orinclude apparatus specially constructed for the required purpose or adigital computer. Various embodiments also relate to apparatus orsystems for performing these operations. These apparatuses may bespecially constructed for the required purpose. The required structurefor a variety of these machines will be apparent from the descriptiongiven.

Reference is now made to the drawings, wherein like reference numeralsare used to refer to like elements throughout. In the followingdescription, for the purpose of explanation, numerous specific detailsare set forth in order to provide a thorough understanding thereof. Itmay be evident, however, that the novel embodiments can be practicedwithout these specific details. In other instances, well-knownstructures and devices are shown in block diagram form in order tofacilitate a description thereof. The intention is to cover allmodification, equivalents, and alternatives within the scope of theclaims.

FIG. 1 depicts an exemplary system 100, consistent with disclosedembodiments. As shown, the system 100 includes a computing system 101.The computing system 101 is representative of any type of computingsystem, such as servers, compute clusters, desktop computers,smartphones, tablet computers, wearable devices, laptop computers,workstations, portable gaming devices, virtualized computing systems,and the like. The computing system 101 includes a processor 102, amemory 103, and may further include a storage, network interface, and/orother components not pictured for the sake of clarity.

As shown, the memory 103 includes an autoencoder 104, a machine learning(ML) model 105, a statistical model 106, and data stores of trainingdata 107 and formatted data 108. The autoencoder 104 is representativeof any type of autoencoder, including variational autoencoders,denoising autoencoders, sparse autoencoders, and contractiveautoencoders. Generally, an autoencoder is a type of artificial neuralnetwork that learns data codings (e.g., the latent vector 109) in anunsupervised manner. Values of the latent vector 109 (also referred toas a code, coding, latent variables, and/or latent representation) maybe learned (or refined) during training of the autoencoder 104, therebytraining the autoencoder 104 to format input data according to a desiredoutput format (which may include formatting according to a desiredoperation). Stated differently, the trained autoencoder 104 mayapproximate any function and/or operation applied to input data. As oneexample, the autoencoder 104 may convert input data comprising integervalues to floating point values. More generally, the autoencoder 104 mayperform any encoding operation, which may include, but is not limitedto, normalizing values of input data, computing a z-score (e.g., asigned value reflecting a number of standard deviations the value ofinput data is from a mean value) for values of input data, standardizingvalues of input data, recasting values of input data, filtering theinput data according to one or more filtering criteria, fuzzing of thevalues of input data, applying statistical filters to the input data,and the like. The use of any particular type of encoding operation as areference example herein should not be considered limiting of thedisclosure, as the disclosure is equally applicable to all types ofencoding operations. Similarly, the use of the term “vector” to describethe latent vector 109 should not be considered limiting of thedisclosure, as the latent vector 109 is also representative of a matrixhaving multiple dimensions (e.g., a vector of vectors).

To train the autoencoder 104, one or more datasets of training data 107may be generated. In one embodiment, the training data 107 comprisescolumnar and/or row-based data, e.g., one or more columns of integervalues, one or more columns of floating point values, etc. Generally,the training data 107 may be representative of multiple datasets of anysize. For example, the training data 107 may include 50 column-baseddatasets, where each dataset has thousands of records (or more).Furthermore, the training data 107 may be segmented (e.g., the trainingdata 107 may comprise a plurality of segments of one or more datasets).In one embodiment, each segmented dataset of training data 107 isoverlapping, such that at least one value of the training data 107appears in at least two segments. For example, a first dataset mayinclude rows 0-1000 of the training data 107, while a second dataset mayinclude rows 900-2000 of the training data 107, such that rows 900-1000appear in the first and second datasets. In one embodiment, the size ofthe datasets may be learned based on hyperparameter tuning.

The ML model 105 and the statistical model 106 are representative of anytype of computing model, such as deep learning models, machine learningmodels, neural networks, classifiers, clustering algorithms, supportvector machines, and the like. In one embodiment, the ML model 105 andthe statistical model 106 comprise the same model. Generally, the MLmodel 105 (and/or the statistical model 106) may be configured totransform (or encode) input data to a target format, thereby generatingan output dataset. For example, the ML model 105 may be configured tonormalize integer values of input data to floating point values, and theoutput dataset may comprise the floating point values. Generally, the MLmodel 105 may compute an output dataset for each input dataset oftraining data 107. An input dataset and corresponding formatted outputdataset generated by the ML model 105 may be referred to as a “trainingsample” herein.

The autoencoder 104 may then be trained using the input dataset of oneor more training samples. Generally, the autoencoder 104 may receive theinput dataset as input, convert the dataset to an encoded format usingthe values of the latent vector 109, and decode the converted dataset.In some embodiments, the converted dataset generated by the autoencoder104 may then be compared to the formatted data of the training samplegenerated by the ML model 105. The comparison may include determining adifference and/or least squared error of the converted dataset generatedby the autoencoder 104 and the formatted data of the training samplegenerated by the ML model 105. Doing so generates one or more valuesreflecting an accuracy of the autoencoder 104. In some embodiments, theaccuracy may comprise a loss of the autoencoder 104.

In some embodiments, the ML model 105 and/or the statistical model 106may receive the converted data generated by the autoencoder 104 todetermine the accuracy of the autoencoder 104 relative to the data ofthe training sample generated by the ML model 105. For example, the MLmodel 105 may process the converted data generated by the autoencoder104 and compare the output to the formatted data of the training sample.In another embodiment, the statistical model 106 may classify theconverted data generated by the autoencoder 104 and compare theclassification to a classification of the input dataset of the trainingsample. For example, the statistical model 106 may classify theformatted output generated by the autoencoder 104 as a dataset of creditcard data. If the statistical model classifies the input dataset of thetraining sample as being credit card data, the statistical model 106 maycompute a relatively high accuracy value for the autoencoder 104. If,however, the classification for the input dataset is for purchase orderamounts, the statistical model 106 may compute a relatively low accuracyvalue for the autoencoder 104. In one embodiment, the statistical model106 may compute the accuracy value for the autoencoder 104 based on adistance between the classifications in a data space, where the accuracyincreases as the distance between the classifications decreases.

The determined accuracy of the autoencoder 104 may then be used torefine the values of the latent vector 109 and/or other components ofthe autoencoder 104 via a backpropagation operation. The backpropagationmay be performed using any feasible backpropagation algorithm.Generally, during backpropagation, the values of the latent vector 109and/or the other components of the autoencoder 104 are refined based onthe accuracy of the formatted output generated by the autoencoder 104.Doing so may result in a latent vector 109 that most accurately maps theinput data to the desired output format.

The training of the autoencoder 104 may be repeated any number of timesuntil the accuracy of the autoencoder 104 exceeds a threshold (and/orthe loss of the autoencoder 104 is below a threshold). The autoencoder104 may then be configured to ingest (e.g., format) data to be processedin any processing platform, such as a streaming data platform, therebygenerating the formatted data 108. In some embodiments, the autoencoder104 may perform estimated ingestion operations. For example, theautoencoder 104 may receive streaming data over a time interval. If thestreaming data is of a reasonable size, the autoencoder 104 may performa predictive formatting operation on the streaming data. For example, byingesting enough streaming data during the time interval, theautoencoder 104 may determine the minimum and maximum values therein.Doing so may allow the autoencoder 104 to normalize the streaming datain a predictive fashion in a single pass. Stated differently, theautoencoder 104 may normalize the streaming data in a single processingphase, rather than having to process the streaming data twice (e.g., todiscover the minimum/maximum values, then normalize the data based onthe identified minimum/maximum values).

FIG. 2 is a schematic 200 illustrating an embodiment of training theautoencoder 104 to perform automated data ingestion. As shown, at block201, one or more datasets of training data 107 may be segmented. Thetraining data 107 may include row-based data and/or column-based data.The segments may have a minimum size (e.g., 10,000 rows and/or columnsof data). In some embodiments, one or more of the segments may bemodified, for example, by dropping one or more columns of data,formatting one or more columns of data, and the like. Doing so mayproduce varying segments of training data 107, e.g., where a firstsegment has had a column dropped, a second segment has had a columnformatted, a third segment has had one column dropped and one columnformatted, and a fourth segment has not been modified.

At block 202, the ML model 105 may process the segmented training data107 to format the segmented training data 107 according to one or moreformatting rules and/or operations. For example, the ML model 105 maynormalize, convert, and/or filter the segmented training data 107. Atblock 203, one or more output datasets generated by the ML model 105 atblock 202 may be stored. The output datasets may include each segment oftraining data 204 and the corresponding formatted data 205 generated bythe ML model 105 at block 202. For example, if 1,000 segments oftraining data were generated at block 201, the segmented training data204 may include the 1,000 segments, while the formatted data 205 mayinclude 1,000 formatted datasets generated by the ML model 105 byprocessing each segment at block 202. In such an example, 1,000 trainingsamples may comprise the segmented training data as input data and thecorresponding formatted data 205 generated by the ML model 105.

At block 206, overlapping datasets may be generated using the trainingsamples of segmented training data 204 and formatted data 205.Continuing with the previous example, the 1,000 training samples may bemodified to include overlapping values. At block 207, the autoencoder104 may be trained using the overlapping datasets generated at block206. For example, the autoencoder 104 may process each input dataset(e.g., the segmented training data 204) of each training sample, e.g.,to convert each of the input datasets of the training samples to adesired output format and/or based on a predefined operation. At block208, the accuracy of the autoencoder 104 is determined based on theoutput generated by the autoencoder 104 at block 207. For example, adifference and/or a least squared error may be computed between theoutput of the autoencoder 104 based on the segmented training data 204and the corresponding formatted data 205 generated by the ML model 105.The difference and/or least squared error may be used as accuracy valuesfor the autoencoder 104.

As another example, the statistical model 106 may classify the outputgenerated by the autoencoder 104 at block 207 and compare the generatedclassification to a classification of the corresponding segmentedtraining data 204. For example, if the output generated by theautoencoder 104 at block 207 for a first overlapping segment of trainingdata 204 matches a classification generated for the formatted data 205corresponding to the first overlapping segment of training data 204, thestatistical model 106 may compute a relatively high accuracy value forthe autoencoder 104 for the first training sample.

The determined accuracy may be used to train the autoencoder 104 via abackpropagation operation. Doing so refines the values of theautoencoder 104, including the latent vector 109, based on thedetermined accuracy values for the autoencoder 104 and/or a loss of theautoencoder 104. Generally, the accuracy at block 208 may be determinedfor each training sample. Therefore, continuing with the previousexample, the accuracy for each of the 1,000 training samples processedby the autoencoder 104 may be determined at block 208. Each of the 1,000accuracy values may be provided to the autoencoder 104 to update theweights of the autoencoder 104, e.g., via 1,000 (or fewer)backpropagation operations.

FIG. 3 illustrates an embodiment of a processing pipeline 300. At block301, streaming input data is received in the processing pipeline 300.The streaming input data may be any type of data, such as transactiondata, stock ticker data, financial data, sensor data, and the like. Insome embodiments, the streaming input data includes numeric values inone or more rows and/or columns. However, the streaming input data mayhave varying types and/or formats which may need to be modified to becompatible with various components of the processing pipeline.Therefore, at block 302, the trained autoencoder 104 may process thestreaming input data. For example, the trained autoencoder 104 mayformat the streaming input data according to a desired output format,normalize the values of the streaming input data, compute a z-score forthe streaming input data, standardizing values of the streaming inputdata, recasting values of the streaming input data, filtering thestreaming input data according to one or more filtering criteria,fuzzing of the values of the streaming input data, and the like. Atblock 303, one or more components of the processing pipeline process theoutput generated by the autoencoder 104 at block 302, e.g., theformatted and/or converted streaming input data. Advantageously, theautoencoder 104 may process the streaming data in a single pass, e.g.,by providing estimated normalization, recasting, etc., and withouthaving to process the streaming data in two or more passes.

FIG. 4 illustrates an embodiment of a logic flow 400. The logic flow 400may be representative of some or all of the operations executed by oneor more embodiments described herein. For example, the logic flow 400may include some or all of the operations to provide automated dataingestion using an autoencoder. Embodiments are not limited in thiscontext.

As shown, the logic flow 400 begins at block 410, where a target dataformat is determined for data. For example, the target format mayspecify a datatype (e.g., integers, floating points, etc.), a data space(e.g., a range of values), etc. More generally, any type of operationmay be determined for the data at block 410, e.g., normalization,filtering, score computation, etc. At block 420, the autoencoder 104 istrained to format data according to the target formats and/or operationsdefined at block 410. Generally, the training of the autoencoder 104 isguided by the ML model 105 and/or the statistical model 106 as describedin greater detail herein. At block 430, the accuracy of the autoencoder104 may be determined to exceed a threshold accuracy level. For example,if the threshold is 90% accuracy, and the accuracy of the autoencoder104 is 95%, the accuracy of the autoencoder may exceed the threshold. Atblock 440, the autoencoder 104 is configured to format data in aprocessing pipeline.

FIG. 5 illustrates an embodiment of a logic flow 500. The logic flow 500may be representative of some or all of the operations executed by oneor more embodiments described herein. For example, the logic flow 500may include some or all of the operations performed to train theautoencoder 104. Embodiments are not limited in this context.

As shown, the logic flow 500 begins at block 510, where the trainingdata 107, which may comprise one or more datasets, is segmented intooverlapping training data subsets. As stated, the training data 107 mayinclude row and/or column-based numerical values. By generatingoverlapping subsets, one or more values of the training data 107 mayappear in two or more subsets. At block 520, the ML model 105 transformsthe training data subsets according to the format defined at block 410.For example, the ML model 105 may be configured to transform thetraining data from a first format to a second format. More generally,the ML model 105 may perform any operation on the training data asdescribed above. Doing so may generate a respective transformed outputdataset for each of the training data subsets. Each training dataset andcorresponding transformed output dataset pair may comprise a trainingsample for the autoencoder. One or more of the training samples may beselected at block 530.

At block 540, the autoencoder 104 may process the input dataset of thetraining sample selected at block 530. Generally, the autoencoder 104may transform the input dataset of the training sample (or perform anyother operation) based at least in part on the current weights of thelatent vector 109. Doing so may generate a transformed output. At block550, the accuracy of the autoencoder 104 is determined based at least inpart on the transformed output generated by the autoencoder 104. Asstated, the ML model 105 and/or the statistical model 106 may be used todetermine the accuracy of the autoencoder 104. For example, a differenceand/or a least squared error may be computed for the output of theautoencoder 104 based on the transformed output dataset of the trainingsample (e.g., the output of the ML model 105) and the output generatedby the autoencoder 104 at block 540. The difference and/or least squarederror may be used as accuracy values for the autoencoder 104. As anotherexample, the statistical model 106 may classify the output generated bythe autoencoder 104 at block 540 and compare the generatedclassification to a classification of the training data of the inputsample selected at block 530. The accuracy of the autoencoder 104 maythen be determined based on a similarity of the classifications, wheremore similar classifications result in higher accuracy values for theautoencoder 104.

At block 560, the accuracy determined at block 550 may be provided tothe autoencoder 104. At block 570, the values of the latent vector 109and any other values of the autoencoder 104 may be refined during abackpropagation operation. Doing so may allow the values of the latentvector 109 to more accurately reflect a mapping required to perform thedesired operation on data (e.g., filtering, formatting, recasting,etc.). If the accuracy of the autoencoder 104 determined at block 550 islower than a threshold accuracy, the logic flow 500 may return to block530, where another training sample is selected, thereby repeating thetraining process until the accuracy of the autoencoder 104 exceeds thethreshold. Once the accuracy of the autoencoder 104 exceeds a thresholdand/or all training samples have been used to train the autoencoder 104,the logic flow 500 may end.

FIG. 6 illustrates an embodiment of an exemplary computing architecture600 comprising a computing system 602 that may be suitable forimplementing various embodiments as previously described. In variousembodiments, the computing architecture 600 may comprise or beimplemented as part of an electronic device. In some embodiments, thecomputing architecture 600 may be representative, for example, of asystem that implements one or more components of the system 100. In someembodiments, computing system 602 may be representative, for example, ofthe computing system 101 of the system 100. The embodiments are notlimited in this context. More generally, the computing architecture 600is configured to implement all logic, applications, systems, methods,apparatuses, and functionality described herein with reference to FIGS.1-5.

As used in this application, the terms “system” and “component” and“module” are intended to refer to a computer-related entity, eitherhardware, a combination of hardware and software, software, or softwarein execution, examples of which are provided by the exemplary computingarchitecture 600. For example, a component can be, but is not limited tobeing, a process running on a computer processor, a computer processor,a hard disk drive, multiple storage drives (of optical and/or magneticstorage medium), an object, an executable, a thread of execution, aprogram, and/or a computer. By way of illustration, both an applicationrunning on a server and the server can be a component. One or morecomponents can reside within a process and/or thread of execution, and acomponent can be localized on one computer and/or distributed betweentwo or more computers. Further, components may be communicativelycoupled to each other by various types of communications media tocoordinate operations. The coordination may involve the uni-directionalor bi-directional exchange of information. For instance, the componentsmay communicate information in the form of signals communicated over thecommunications media. The information can be implemented as signalsallocated to various signal lines. In such allocations, each message isa signal. Further embodiments, however, may alternatively employ datamessages. Such data messages may be sent across various connections.Exemplary connections include parallel interfaces, serial interfaces,and bus interfaces.

The computing system 602 includes various common computing elements,such as one or more processors, multi-core processors, co-processors,memory units, chipsets, controllers, peripherals, interfaces,oscillators, timing devices, video cards, audio cards, multimediainput/output (I/O) components, power supplies, and so forth. Theembodiments, however, are not limited to implementation by the computingsystem 602.

As shown in FIG. 6, the computing system 602 comprises a processor 604,a system memory 606 and a system bus 608. The processor 604 can be anyof various commercially available computer processors, including withoutlimitation an AMD® Athlon®, Duron® and Opteron® processors; ARM®application, embedded and secure processors; IBM® and Motorola®DragonBall® and PowerPC® processors; IBM and Sony® Cell processors;Intel® Celeron®, Core®, Core (2) Duo®, Itanium®, Pentium®, Xeon®, andXScale® processors; and similar processors. Dual microprocessors,multi-core processors, and other multi processor architectures may alsobe employed as the processor 604.

The system bus 608 provides an interface for system componentsincluding, but not limited to, the system memory 606 to the processor604. The system bus 608 can be any of several types of bus structurethat may further interconnect to a memory bus (with or without a memorycontroller), a peripheral bus, and a local bus using any of a variety ofcommercially available bus architectures. Interface adapters may connectto the system bus 608 via a slot architecture. Example slotarchitectures may include without limitation Accelerated Graphics Port(AGP), Card Bus, (Extended) Industry Standard Architecture ((E)ISA),Micro Channel Architecture (MCA), NuBus, Peripheral ComponentInterconnect (Extended) (PCI(X)), PCI Express, Personal Computer MemoryCard International Association (PCMCIA), and the like.

The system memory 606 may include various types of computer-readablestorage media in the form of one or more higher speed memory units, suchas read-only memory (ROM), random-access memory (RAM), dynamic RAM(DRAM), Double-Data-Rate DRAM (DDRAM), synchronous DRAM (SDRAM), staticRAM (SRAM), programmable ROM (PROM), erasable programmable ROM (EPROM),electrically erasable programmable ROM (EEPROM), flash memory (e.g., oneor more flash arrays), polymer memory such as ferroelectric polymermemory, ovonic memory, phase change or ferroelectric memory,silicon-oxide-nitride-oxide-silicon (SONOS) memory, magnetic or opticalcards, an array of devices such as Redundant Array of Independent Disks(RAID) drives, solid state memory devices (e.g., USB memory, solid statedrives (SSD) and any other type of storage media suitable for storinginformation. In the illustrated embodiment shown in FIG. 6, the systemmemory 606 can include non-volatile memory 610 and/or volatile memory612. A basic input/output system (BIOS) can be stored in thenon-volatile memory 610.

The computing system 602 may include various types of computer-readablestorage media in the form of one or more lower speed memory units,including an internal (or external) hard disk drive (HDD) 614, amagnetic floppy disk drive (FDD) 616 to read from or write to aremovable magnetic disk 618, and an optical disk drive 620 to read fromor write to a removable optical disk 622 (e.g., a CD-ROM or DVD). TheHDD 614, FDD 616 and optical disk drive 620 can be connected to thesystem bus 608 by a HDD interface 624, an FDD interface 626 and anoptical drive interface 628, respectively. The HDD interface 624 forexternal drive implementations can include at least one or both ofUniversal Serial Bus (USB) and IEEE 1394 interface technologies. Thecomputing system 602 is generally is configured to implement all logic,systems, methods, apparatuses, and functionality described herein withreference to FIGS. 1-5.

The drives and associated computer-readable media provide volatileand/or nonvolatile storage of data, data structures, computer-executableinstructions, and so forth. For example, a number of program modules canbe stored in the drives and memory units 610, 612, including anoperating system 630, one or more application programs 632, otherprogram modules 634, and program data 636. In one embodiment, the one ormore application programs 632, other program modules 634, and programdata 636 can include, for example, the various applications and/orcomponents of the system 100, e.g., the autoencoder 104, ML model 105,statistical model 106, training data 107, formatted data 108, and latentvector 109.

A user can enter commands and information into the computing system 602through one or more wire/wireless input devices, for example, a keyboard638 and a pointing device, such as a mouse 640. Other input devices mayinclude microphones, infra-red (IR) remote controls, radio-frequency(RF) remote controls, game pads, stylus pens, card readers, dongles,finger print readers, gloves, graphics tablets, joysticks, keyboards,retina readers, touch screens (e.g., capacitive, resistive, etc.),trackballs, trackpads, sensors, styluses, and the like. These and otherinput devices are often connected to the processor 604 through an inputdevice interface 642 that is coupled to the system bus 608, but can beconnected by other interfaces such as a parallel port, IEEE 1394 serialport, a game port, a USB port, an IR interface, and so forth.

A monitor 644 or other type of display device is also connected to thesystem bus 608 via an interface, such as a video adaptor 646. Themonitor 644 may be internal or external to the computing system 602. Inaddition to the monitor 644, a computer typically includes otherperipheral output devices, such as speakers, printers, and so forth.

The computing system 602 may operate in a networked environment usinglogical connections via wire and/or wireless communications to one ormore remote computers, such as a remote computer 648. The remotecomputer 648 can be a workstation, a server computer, a router, apersonal computer, portable computer, microprocessor-based entertainmentappliance, a peer device or other common network node, and typicallyincludes many or all of the elements described relative to the computingsystem 602, although, for purposes of brevity, only a memory/storagedevice 650 is illustrated. The logical connections depicted includewire/wireless connectivity to a local area network (LAN) 652 and/orlarger networks, for example, a wide area network (WAN) 654. Such LANand WAN networking environments are commonplace in offices andcompanies, and facilitate enterprise-wide computer networks, such asintranets, all of which may connect to a global communications network,for example, the Internet.

When used in a LAN networking environment, the computing system 602 isconnected to the LAN 652 through a wire and/or wireless communicationnetwork interface or adaptor 656. The adaptor 656 can facilitate wireand/or wireless communications to the LAN 652, which may also include awireless access point disposed thereon for communicating with thewireless functionality of the adaptor 656.

When used in a WAN networking environment, the computing system 602 caninclude a modem 658, or is connected to a communications server on theWAN 654, or has other means for establishing communications over the WAN654, such as by way of the Internet. The modem 658, which can beinternal or external and a wire and/or wireless device, connects to thesystem bus 608 via the input device interface 642. In a networkedenvironment, program modules depicted relative to the computing system602, or portions thereof, can be stored in the remote memory/storagedevice 650. It will be appreciated that the network connections shownare exemplary and other means of establishing a communications linkbetween the computers can be used.

The computing system 602 is operable to communicate with wired andwireless devices or entities using the IEEE 802 family of standards,such as wireless devices operatively disposed in wireless communication(e.g., IEEE 802.16 over-the-air modulation techniques). This includes atleast Wi-Fi (or Wireless Fidelity), WiMax, and Bluetooth™ wirelesstechnologies, among others. Thus, the communication can be a predefinedstructure as with a conventional network or simply an ad hoccommunication between at least two devices. Wi-Fi networks use radiotechnologies called IEEE 802.11x (a, b, g, n, etc.) to provide secure,reliable, fast wireless connectivity. A Wi-Fi network can be used toconnect computers to each other, to the Internet, and to wire networks(which use IEEE 802.3-related media and functions).

Various embodiments may be implemented using hardware elements, softwareelements, or a combination of both. Examples of hardware elements mayinclude processors, microprocessors, circuits, circuit elements (e.g.,transistors, resistors, capacitors, inductors, and so forth), integratedcircuits, application specific integrated circuits (ASIC), programmablelogic devices (PLD), digital signal processors (DSP), field programmablegate array (FPGA), logic gates, registers, semiconductor device, chips,microchips, chip sets, and so forth. Examples of software may includesoftware components, programs, applications, computer programs,application programs, system programs, machine programs, operatingsystem software, middleware, firmware, software modules, routines,subroutines, functions, methods, procedures, software interfaces,application program interfaces (API), instruction sets, computing code,computer code, code segments, computer code segments, words, values,symbols, or any combination thereof. Determining whether an embodimentis implemented using hardware elements and/or software elements may varyin accordance with any number of factors, such as desired computationalrate, power levels, heat tolerances, processing cycle budget, input datarates, output data rates, memory resources, data bus speeds and otherdesign or performance constraints.

One or more aspects of at least one embodiment may be implemented byrepresentative instructions stored on a machine-readable medium whichrepresents various logic within the processor, which when read by amachine causes the machine to fabricate logic to perform the techniquesdescribed herein. Such representations, known as “IP cores” may bestored on a tangible, machine readable medium and supplied to variouscustomers or manufacturing facilities to load into the fabricationmachines that make the logic or processor. Some embodiments may beimplemented, for example, using a machine-readable medium or articlewhich may store an instruction or a set of instructions that, ifexecuted by a machine, may cause the machine to perform a method and/oroperations in accordance with the embodiments. Such a machine mayinclude, for example, any suitable processing platform, computingplatform, computing device, processing device, computing system,processing system, computer, processor, or the like, and may beimplemented using any suitable combination of hardware and/or software.The machine-readable medium or article may include, for example, anysuitable type of memory unit, memory device, memory article, memorymedium, storage device, storage article, storage medium and/or storageunit, for example, memory, removable or non-removable media, erasable ornon-erasable media, writeable or re-writeable media, digital or analogmedia, hard disk, floppy disk, Compact Disk Read Only Memory (CD-ROM),Compact Disk Recordable (CD-R), Compact Disk Rewriteable (CD-RW),optical disk, magnetic media, magneto-optical media, removable memorycards or disks, various types of Digital Versatile Disk (DVD), a tape, acassette, or the like. The instructions may include any suitable type ofcode, such as source code, compiled code, interpreted code, executablecode, static code, dynamic code, encrypted code, and the like,implemented using any suitable high-level, low-level, object-oriented,visual, compiled and/or interpreted programming language.

The foregoing description of example embodiments has been presented forthe purposes of illustration and description. It is not intended to beexhaustive or to limit the present disclosure to the precise formsdisclosed. Many modifications and variations are possible in light ofthis disclosure. It is intended that the scope of the present disclosurebe limited not by this detailed description, but rather by the claimsappended hereto. Future filed applications claiming priority to thisapplication may claim the disclosed subject matter in a differentmanner, and may generally include any set of one or more limitations asvariously disclosed or otherwise demonstrated herein.

What is claimed is:
 1. A system, comprising: a processor circuit; and amemory storing instructions which when executed by the processorcircuit, cause the processor circuit to: receive, by an autoencoderduring a first time interval, streaming data comprising numeric values;determine, by the autoencoder during the first time interval, a maximumvalue and a minimum value of a first subset of the numeric values; andprocess, by the autoencoder during the first time interval, a secondsubset of the numeric values based on the determined maximum and minimumvalues.
 2. The system of claim 1, wherein processing the second subsetof the numeric values comprises normalizing the second subset of thenumeric values to be within the determined maximum and minimum values.3. The system of claim 1, wherein processing the second subset of thenumeric values comprises filtering a numeric value from the secondsubset that is not within the determined maximum and minimum values,wherein filtering the numeric value from the second subset removes thefiltered numeric value from the second subset.
 4. The system of claim 1,wherein processing the second subset of the numeric values comprisesconverting a numeric value from the second subset that is not within thedetermined maximum and minimum values from a first data type to a seconddata type, wherein the converted numeric value of the second data typeis within the determined maximum and minimum values.
 5. The system ofclaim 1, wherein the autoencoder is trained based on a training datasetgenerated by a computing model, wherein the autoencoder is trained toprocess numeric values according to a predefined operation, wherein theautoencoder comprises a latent vector.
 6. The system of claim 5, whereinan accuracy of the trained autoencoder exceeds a threshold accuracy. 7.The system of claim 1, the memory storing instructions which whenexecuted by the processor circuit, cause the processor circuit to:provide, by the autoencoder, the processed second subset of the numericvalues to a processing pipeline.
 8. A non-transitory computer-readablestorage medium storing instructions that when executed by a processorcause the processor to: receive, by an autoencoder during a first timeinterval, streaming data comprising numeric values; determine, by theautoencoder during the first time interval, a maximum value and aminimum value of a first subset of the numeric values; and process, bythe autoencoder during the first time interval, a second subset of thenumeric values based on the determined maximum and minimum values. 9.The medium of claim 8, wherein processing the second subset of thenumeric values comprises normalizing the second subset of the numericvalues to be within the determined maximum and minimum values.
 10. Themedium of claim 8, wherein processing the second subset of the numericvalues comprises filtering a numeric value from the second subset thatis not within the determined maximum and minimum values, whereinfiltering the numeric value from the second subset removes the filterednumeric value from the second subset.
 11. The medium of claim 8, whereinprocessing the second subset of the numeric values comprises convertinga numeric value from the second subset that is not within the determinedmaximum and minimum values from a first data type to a second data type,wherein the converted numeric value of the second data type is withinthe determined maximum and minimum values.
 12. The medium of claim 8,wherein the autoencoder is trained based on a training dataset generatedby a computing model, wherein the autoencoder is trained to processnumeric values according to a predefined operation, wherein theautoencoder comprises a latent vector.
 13. The medium of claim 12,wherein an accuracy of the trained autoencoder exceeds a thresholdaccuracy.
 14. The medium of claim 8, storing instructions which whenexecuted by the processor, cause the processor to: provide, by theautoencoder, the processed second subset of the numeric values to aprocessing pipeline.
 15. A method, comprising: receiving, by anautoencoder executing on a computer processor, streaming data comprisingnumeric values during a first time interval; determining, by theautoencoder during the first time interval, a maximum value and aminimum value of a first subset of the numeric values; and processing,by the autoencoder during the first time interval, a second subset ofthe numeric values based on the determined maximum and minimum values.16. The method of claim 15, wherein processing the second subset of thenumeric values comprises normalizing the second subset of the numericvalues to be within the determined maximum and minimum values.
 17. Themethod of claim 15, wherein processing the second subset of the numericvalues comprises filtering a numeric value from the second subset thatis not within the determined maximum and minimum values, whereinfiltering the numeric value from the second subset removes the filterednumeric value from the second subset.
 18. The method of claim 15,wherein processing the second subset of the numeric values comprisesconverting a numeric value from the second subset that is not within thedetermined maximum and minimum values from a first data type to a seconddata type, wherein the converted numeric value of the second data typeis within the determined maximum and minimum values.
 19. The method ofclaim 15, wherein the autoencoder is trained based on a training datasetgenerated by a computing model, wherein the autoencoder is trained toprocess numeric values according to a predefined operation, wherein theautoencoder comprises a latent vector, wherein an accuracy of thetrained autoencoder exceeds a threshold accuracy.
 20. The method ofclaim 15, further comprising: providing, by the autoencoder, theprocessed second subset of the numeric values to a processing pipeline.